AI - Spam Text Detector
The Spam Text Detection project is all about building an AI system that can automatically detect and filter out spam messages. Using the Naive Bayes algorithm, the system learns to identify unwanted text, whether it’s in emails, SMS, or other forms of communication. With the help of Python, Scikit-learn, and NLTK, we clean and process text data before training a model to accurately classify whether a message is spam or not. The goal of this project is to save users time and effort by automating the spam detection process.
Technologies Used
- Python
- Scikit-learn
- NLTK (Natural Language Toolkit)
- Pandas
In this comprehensive guide, we will walk you through building your own Spam Text Detection system from scratch. The process includes setting up the environment, preprocessing text data, training the Naive Bayes model, and integrating it to classify messages as spam or not spam. With Python, Scikit-learn, and NLTK, you'll create a reliable system capable of filtering out unwanted messages. Plus, you can follow along with a detailed YouTube video walkthrough that visually guides you through each step.

Why Build This AI?
The AI has several real-world uses, such as:
- Email Filtering: Automatically filters spam from inboxes in email clients.
- Cybersecurity: Blocks phishing attempts and malware by classifying harmful emails.
- Productivity Boost: Reduces the time spent on managing and cleaning up spam.
- SMS Filtering: Can be used to filter spam SMS messages in mobile applications.
- Customer Support: Filters out junk messages in customer service platforms, focusing agents on real inquiries.
Let's Build It
Import Necessary Libraries
- re and string: Used for text preprocessing, such as removing punctuation and unwanted characters.
- pandas: Helps with handling and processing the dataset in DataFrame format for easier manipulation.
- TfidfVectorizer: A feature extraction technique that converts text data into numerical vectors, representing the importance of each word in the document.
- MultinomialNB: The Naive Bayes classifier used to categorize the text as spam or non-spam.
- train_test_split: A function to split the data into training and testing sets, enabling model evaluation.
# Step 1: Import necessary libraries
import re
import string
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
Manually Create a Small Custom Dataset (Spam vs Ham)
In this step, we manually create a small dataset of messages labeled as either spam (1) or ham (0):
- data dictionary: The dataset consists of two columns:
- text: A list of messages, some of which are spam (e.g., prize offers, promotions), and others are non-spam (e.g., personal conversations, reminders).
- label: A list of corresponding labels where 1 represents spam and 0 represents ham (non-spam).
- pd.DataFrame(data): The data dictionary is converted into a Pandas DataFrame, which provides a structured table format to work with, making it easier to manipulate and process the data for training and testing the machine learning model.
# Step 2: Manually create a small custom dataset (Spam vs Ham)
data = {
"text": [
"Free entry in 2 a wkly comp to win FA Cup finals ticket",
"Hello, how are you doing?",
"Claim your free prize now by clicking the link",
"Are we still meeting for dinner?",
"This is your last chance to claim a prize",
"Reminder: Your meeting is tomorrow at 10 AM",
"You have an urgent message from the bank",
"Please call me when you get a chance",
"You have won a $1000 prize! Claim it now",
"Let's grab lunch tomorrow!",
"Important notice regarding your account",
"Congratulations! You've been selected for a special offer"
],
"label": [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1] # 1 -> spam, 0 -> ham
}
# Convert into DataFrame
df = pd.DataFrame(data)
Preprocess the Text
In this step, we define a function to preprocess the text data by cleaning it. The function clean_text
performs several important tasks to make the text suitable for machine learning:
- Convert to lowercase: All characters are converted to lowercase to ensure consistency (e.g., "Free" and "free" are treated the same).
- Remove numbers: Any digits present in the text are removed using the regular expression
\d+
. - Remove punctuation: Using
string.punctuation
, we remove all punctuation marks like commas, periods, and exclamation points. - Remove URLs: Any URLs (like "http://..." or "www...") are removed using a regular expression
https?://\S+|www\.\S+
. - Remove extra spaces: Any extra spaces between words are reduced to a single space, and leading/trailing spaces are stripped.
# Step 3: Preprocess the text
def clean_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'\d+', '', text) # Remove numbers
text = text.translate(str.maketrans("", "", string.punctuation)) # Remove punctuation
text = re.sub(r'https?://\S+|www\.\S+', '', text) # Remove URLs
text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces
return text
df["clean_text"] = df["text"].apply(clean_text)
Convert Text into Numerical Format
# Step 4: Convert text into numerical format
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]
Train a Naïve Bayes Model
In this step, we train the spam classifier using the Naïve Bayes algorithm, which is well-suited for text classification tasks. Here's a breakdown of the code:
- train_test_split(X, y, test_size=0.2, random_state=42):
- This function splits the dataset into training and testing sets.
- X represents the feature data (the numerical representation of the text).
- y represents the labels (spam or ham).
- test_size=0.2: Allocates 20% of the data for testing, and 80% for training.
- random_state=42: Ensures reproducibility by setting a seed for the random splitting.
- MultinomialNB():
- This initializes a Naïve Bayes classifier that is particularly effective for text data (often referred to as Multinomial Naive Bayes) because it works well when the features (words) are conditionally independent.
- model.fit(X_train, y_train):
- The
fit()
method trains the Naïve Bayes model on the training data (X_train) and corresponding labels (y_train). This step allows the model to learn the relationships between the text features and the labels (spam or ham).
- The
# Step 5: Train a Naïve Bayes model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
User Input Spam Detection Function
This step defines a function that takes user input, processes it, and predicts whether it’s spam or not.
- User Input: Prompts the user to enter a message.
- Preprocessing: The input is cleaned using the
clean_text()
function (removes numbers, punctuation, and URLs). - Prediction: The cleaned text is vectorized, then passed to the trained Naïve Bayes model to predict if it's spam (1) or not spam (0).
- Output: Displays the prediction ("🚫 Spam" or "✅ Not Spam").
# Step 6: Function for user input spam detection
def predict_spam():
user_input = input("Enter text to check if it's spam or not: ")
cleaned_input = clean_text(user_input)
input_vector = vectorizer.transform([cleaned_input])
prediction = model.predict(input_vector)[0]
result = "🚫 Spam" if prediction == 1 else "✅ Not Spam"
print("\n🔍 Prediction:", result)
predict_spam()
Spam Messages examples:
- Congratulations! You have won a free iPhone. Click here to claim your prize now!
- Exclusive offer! Buy one get one free on all purchases. Limited time only. Claim now!
Not Spam Messages examples:
- Hey John, let's catch up for coffee tomorrow at 5 PM.
- Can you send me the report before 3 PM today?